Regression Model Builder
Description
Regression Model Builder step builds a regression model based on training data.
Configurations
No. | Field Name | Description |
---|---|---|
1 | Step Name | Specify the name of the step. Step names should be unique within a workflow. |
2 | Number of Rows to Process | Select the number of rows that you want to process. Available options are: - All - Batch Governs if all the rows of dataset are passed in one shot or they are batched. Typically if you are building model on a very large dataset, you can use Batch row processing. |
3 | Size | Select the batch size of the dataset. If your dataset has 50,000 rows, 1,000 can be a good batch size candidate. Note: You can specify a dataset, if you have selected Batch in the Number of Rows to Process field. |
4 | Build Using AE Model Version | Specify the AE model version. Available options are: Version 1.0 [Python 3.6] and Version 2.0 [Python 3.8] |
5 | File name | Specify the name of the file that contains the model. |
6 | Algorithm | Select the algorithm to build the model. Available algorithms are: - Linear Regression - Random Forest Regression - Support Vector Regression |
7 | Tuning Algorithms | Select the hyper tuning parameters. Note: Grid Search is supported currently. |
8 | Algorithm Parameters* | Provide / select the algorithm parameters. Note: The algorithm parameters available in the Alogrithm Parameters field depend on the selected algorithm. For more details, see Algorithms. |
Field Mapping Tab | ||
1 | Name | Specify the name of the input field that needs to be passed for model building purpose |
2 | Incoming Type | Specify the data type of the field. The data type can be either string or number. |
3 | Text Preprocessing | All the classification algorithms work on vectors of numbers. Fields which are of type String need to be converted internally to numeric vectors and this cell lets you specify all the Text Processing attributes on that field. This cell can be clicked only for fields with String data type. Ensuing dialog when you click on it has two tabs. - First tab lets you specify one or more text processing options. - Remove punctuation: removes standard punctuation marks from the text. - Remove Stop Words: removes stop words like ‘the’, ‘as’, ‘in’, and so on. - Additional Stop Words: choose a simple text file where every additional stop word is there on a separate line. These are your domain specific stop words. - Lemmatization: Converts words, such as mice to mouse, houses to house, and so on. - Stemming: Gets stem of the word no matter what word form is used in the text. Therefore, going, went, and goes is converted to go. - Second tab lets you Test your text processing options. In the text box next to ‘Value:’ you can type any text. Clicking on ‘Test’ button will give you the text in the text box next to ‘Result:’ taking into account text processing options you have selected. |
4 | Get Fields | Click to get values from the previous step. |
5 | Class/Target Field | Specify the target field for the regression. |
When you are processing a feature of type string, as mentioned in ‘Text Processing’ section of above table, this feature needs to be converted into numeric features. Text Vectorization Tab governs how all string features get converted into numeric features. An n-gram is a contiguous 240 of 565 Plugin Reference sequence of n items from a given sample of text or speech. Table below shows how internally a string gets tokenized given different values of n-gram.
String | N,Gram Start/End | Tokens |
---|---|---|
Weather today is good | 1-1 | 'Weather', 'today', 'good' |
Weather today is good | 1-2 | 'Weather', 'today', 'good', 'Weather today', 'today good' |
Weather today is good | 1-3 | 'Weather', 'today', 'good', 'Weather today', 'today good', 'Weather today good' |
Weather today is good | 2-3 | 'Weather today', 'today good', 'Weather today good' |
*is treated as stop word and not considered
No. | Field Name | Description |
---|---|---|
Text Vectorization Tab | ||
1 | N Gram start | Specify a numeric value with minimum of 1. |
2 | N Gram start | Specify a numeric value greater than or equal to N Gram start. |
3 | Vectorization | - N-Gram operation tokenizes input string feature. Vectorization is the operation where these tokens are converted to numeric features which are needed by the algorithms. There are three types of vectorizers supported - Count Vectorizer: It counts the number of times a token shows up in the document and uses this value as its weight. - Tfidf Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora. - Hashing Vectorizer: It is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved. |
Evaluation Tab | ||
1 | Evaluation Type | Choose an Evaluation Algorithm Type from the drop down list as seen in the snapshot below - None: Choose None if Evaluation is not needed - Train/Test Split: This Evaluation Algorithm splits the data into Train and Test as per parameters specified below. The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset. |
2 | Test Percentage | For Train/Test Split: Data Types allowed: default value float, int or None, optional (default=None) - If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. - If int, represents the absolute number of test samples. If None, it will be set to 0.25. |
3 | Random State | For Train/Test Split: Data Types allowed: int, RandomState instance or None, optional (default=None) - If int, random_state is the seed used by the random number generator; - If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. |
4 | Evaluation Output File Name | Absolute html report output file path. |
5 | Add output filename to result | Enable checkbox to display downloadable link of html report output file on AE portal. |
Algorithms The following table lists the algorithms along with a description and snapshots of corresponding parameters.
No. | Algorithm Description | Algorithm Parameter Description |
---|---|---|
1 | Linear Regression: LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. | NA |
2 | Random Forest Regression: A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree. | n_estimators: int, default=100 The number of trees in the forest. Criterion: {“squared_error”, “absolute_error”, “poisson”}, default=”squared_error” The function to measure the quality of a split. Supported criteria are “squared_error” for the mean squared error, which is equal to variance reduction as feature selection criterion, “absolute_error” for the mean absolute error, and “poisson” which uses reduction in Poisson deviance to find splits. Training using “absolute_error” is significantly slower than when using “squared_error”. max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split. min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider min_samples_leaf as the minimum number. - If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node. min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. max_features: {“auto”, “sqrt”, “log2”}, int or float, default=”auto” The number of features to consider when looking for the best split: - If “auto”, then max_features=n_features. - If “sqrt”, then max_features=sqrt(n_features) .- If “log2”, then max_features=log2(n_features) .-If None, then max_features=n_features. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features. max_leaf_nodes: int, default=None Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed. Bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. oob_score: bool, default=False Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True. n_jobs: int, default=None The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details. max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw X.shape[0] samples. - If int, then draw max_samples samples. - If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0). |
3 | Support Vector Regression: The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to datasets with more than a couple of 10000 samples. For large datasets consider using LinearSVR or SGDRegressor instead, possibly after a Nystroem transformer. | Kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’ Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix. Degree: int, default=3 Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. Gamma: {‘scale’, ‘auto’} or float, default=’scale’ Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. - if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma, if ‘auto’, uses 1 / n_features. coef0: float, default=0.0 Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’. C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. Epsilon: float, default=0.1Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value. Shrinking: bool, default=True Whether to use the shrinking heuristic. See the User Guide. max_iter: int, default=-1 Hard limit on iterations within solver, or -1 for no limit. |
Limitations:
User may get a value conversion error in the scenario where the count of fields in the Microsoft Excel Input step differs from those that need to be passed to the ML: Model Builder step. The error occurs because of incorrect data type conversion in the Microsoft Excel Input step.
The workaround is:
Ensure that the Microsoft Excel Input step has the same fields as that required in the ML: Model Builder step.
OR
- Ensure data type of all fields is String.